StatLib library which is maintained at Carnegie Mellon University. The dataset has the following fields:mpg: miles per gallon
cylinders: number of cylinders
displacement: engine displacement (cu. inches)
horsepower: engine horsepower
acceleration: time to accelerate from 0 to 60 mph (sec.)
year: model year (modulo 100)
origin: origin of car (1. American, 2. European, 3. Japanese)
name: vehicle name
mpg as the response and horsepower as the predictor. Comment on why you need to change the horsepower variable before performing the regression.horsepower variable is read in as a factor variable due to the presence of “?”. We need to convert this to numeric as shown in the code, to make it a reasonable model.auto<-read.csv("Auto.csv")
str(auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
auto$horsepower <- as.numeric(as.character(auto$horsepower))
## Warning: NAs introduced by coercion
model1<- lm(mpg~horsepower, data=auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
predict.lm function on how the confidence interval can be computed for predictions with R.predict.lm(model1,newdata=data.frame(horsepower=98),interval=c("confidence"),level=.99)
## fit lwr upr
## 1 24.46708 23.81669 25.11747
cor(auto$mpg,auto$horsepower, use = "pairwise.complete.obs") which excludes entries whenever one of the entries is NA. Here \(R^2=0.6059\).cor(auto$mpg,auto$horsepower, use = "pairwise.complete.obs")
## [1] -0.7784268
cor(auto$mpg,auto$horsepower, use = "pairwise.complete.obs")^2
## [1] 0.6059483
library(ggplot2)
ggplot(auto,aes(horsepower,mpg))+geom_point(na.rm=T)+geom_smooth(method="lm",na.rm=T,se=F)
## `geom_smooth()` using formula 'y ~ x'
# Alternative plotting
#plot(auto$horsepower,auto$mpg)
#abline(model1)
ggfortify which aids plotting linear models with ggplot2. Use the following two commands in R to produce diagnostic plots of the linear regression fit:> library(ggfortify)> autoplot(your_model_name)library(ggfortify)
autoplot(model1,label.size = 3)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Alternative plotting without ggplot
#layout(matrix(1:4,2,2))
#plot(model1)
Auto dataset building on Question 1.pairs(your_data_name). Instead we use the package psych and pairs.panel.library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
pairs.panels(auto,ellipses = F, lm =T, breaks=10, hist.col="blue")
cor(). You need to exclude the name variable which is qualitative.Answer. We use subset(auto,select=-c(name)) to remove name from the set. The presence of NA in the variable horsepower makes all the correlations with horsepower NA.
str(auto)
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
auto1<-subset(auto,select=-c(name))
str(auto1)
## 'data.frame': 397 obs. of 8 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
cor(auto1)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7762599 -0.8044430 NA -0.8317389
## cylinders -0.7762599 1.0000000 0.9509199 NA 0.8970169
## displacement -0.8044430 0.9509199 1.0000000 NA 0.9331044
## horsepower NA NA NA 1 NA
## weight -0.8317389 0.8970169 0.9331044 NA 1.0000000
## acceleration 0.4222974 -0.5040606 -0.5441618 NA -0.4195023
## year 0.5814695 -0.3467172 -0.3698041 NA -0.3079004
## origin 0.5636979 -0.5649716 -0.6106643 NA -0.5812652
## acceleration year origin
## mpg 0.4222974 0.5814695 0.5636979
## cylinders -0.5040606 -0.3467172 -0.5649716
## displacement -0.5441618 -0.3698041 -0.6106643
## horsepower NA NA NA
## weight -0.4195023 -0.3079004 -0.5812652
## acceleration 1.0000000 0.2829009 0.2100836
## year 0.2829009 1.0000000 0.1843141
## origin 0.2100836 0.1843141 1.0000000
mpg as the response and all other variables except name as the predictors. Comment on the output by answering the following questions:
displacement, weight, and origin are statistically significant at a 1% level.year variable suggest?year is positive and the p-value is close to 0. This shows that year is positively related to mpg where every year adds 0.7508 miles per gallon everything else staying the same.model2 <-lm(mpg~., data=auto1)
summary(model2)
##
## Call:
## lm(formula = mpg ~ ., data = auto1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
set.seed(1)
x1 <- runif(100)
x2 <- 0.5*x1 + rnorm(100)/10
y <- 2 + 2*x1 + 0.3*x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
Answer. The form of the linear model is: \[ y = \beta_0+ \beta_1 x_1 + \beta_2 x_2 +\epsilon\] where the coefficients are \(\beta_0=2, \beta_1=2, \beta_2=0.3\).
x1 and x2? Create a scatterplot displaying the relationship between the variables. Answer. The correlation is 0.8351. This clearly shows a strong positive correlation between x1 and x2 (as also seen in the scatterplot).cor(x1,x2)
## [1] 0.8351212
ggplot(cbind(x1,x2,y),aes(x1,x2))+geom_point()
y using x1 and x2.
model3 <- lm(y~x1+x2)
summary(model3)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
y using only x1.
model4 <- lm(y~x1)
summary(model4)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
y using only x2.
model5 <- lm(y~x2)
summary(model5)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
x1 and x2. In doing multiple regression we see this effect where it is difficult to reject \(H_0:\beta_j=0\) (for one of the coefficients), while we see that with a single regression (with one variable), we can reject \(H_0:\beta_j=0\). This is caused by multicollinearity.Boston dataset. This data was part of an important paper in 1978 by Harrison and Rubinfeld titled ???Hedonic housing prices and the demand for clean air??? published in the Journal of Environmental Economics and Management 5(1): 81-102. The dataset has the following fields:crim: per capita crime rate by townzn: proportion of residential land zoned for lots over 25,000 sq.ftindus: proportion of non-retail business acres per townchas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox: nitrogen oxides concentration (parts per 10 million)rm: average number of rooms per dwellingage: proportion of owner-occupied units built prior to 1940dis: weighted mean of distances to five Boston employment centresrad: index of accessibility to radial highwaystax: full-value property-tax rate per $10,000ptratio: pupil-teacher ratio by townblack: \(1000(Bk-0.63)^2\) where Bk is the proportion of black residents by townlstat: lower status of the population (percent)medv: median value of owner-occupied homes in $1000smodel1 and model13 below. It should be done for all models to check that the p-values are close to zero for testing \(H_0: \beta_j=0\) for \(j=1,\ldots,13\).boston <- read.csv("Boston.csv")
colnames(boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
model1 <- lm(medv~crim, data=boston)
model2 <- lm(medv~zn, data=boston)
model3 <- lm(medv~indus, data=boston)
model4 <- lm(medv~chas, data=boston)
model5 <- lm(medv~nox, data=boston)
model6 <- lm(medv~rm, data=boston)
model7 <- lm(medv~age, data=boston)
model8 <- lm(medv~dis, data=boston)
model9 <- lm(medv~rad, data=boston)
model10 <- lm(medv~tax, data=boston)
model11 <- lm(medv~ptratio, data=boston)
model12 <- lm(medv~black, data=boston)
model13 <- lm(medv~lstat, data=boston)
summary(model1)
##
## Call:
## lm(formula = medv ~ crim, data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.957 -5.449 -2.007 2.512 29.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.03311 0.40914 58.74 <2e-16 ***
## crim -0.41519 0.04389 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
# Verify this for all the models by checking that the p-values are close to 0.
summary(model13)
##
## Call:
## lm(formula = medv ~ lstat, data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
ggplot(boston,aes(lstat,medv))+geom_point(na.rm=T)+geom_smooth(method="lm",na.rm=T,se=F)
## `geom_smooth()` using formula 'y ~ x'
crim, zn, chas, nox, rm, dis, rad, tax, ptratio, black, lstat at the 0.05 significance level.modelall<- lm(medv~., data=boston)
summary(modelall)
##
## Call:
## lm(formula = medv ~ ., data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Ind <- c(model1$coef[2], model2$coef[2], model3$coef[2], model4$coef[2], model5$coef[2],
model6$coef[2], model7$coef[2], model8$coef[2], model9$coef[2], model10$coef[2],
model11$coef[2], model12$coef[2], model13$coef[2])
All <- modelall$coef[2:14]
ggplot(cbind(Ind,All),aes(Ind,All)) + geom_point()+geom_smooth(method="lm",se=F)+ggtitle("Coefficient relationship") + xlab("Simple linear regression") + ylab("Multiple linear regression")
## `geom_smooth()` using formula 'y ~ x'
lstat predictor variable and the response? To answer the question, fit a model of the formmedv = \(\beta_0\)+ \(\beta_0\)lstat + \(\beta_0\)lstat2 + \(\epsilon\).poly() function in R. Does this help improve the fit? Add higher degree polynomials to fit the data. What is the degree of the polynomial fit beyond which the terms no longer remain significant?(lstat,medv) along with the linear (blue curve) and polynomial of degree 5 (red curve) fits below.summary(model13)
##
## Call:
## lm(formula = medv ~ lstat, data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
modelpoly2 <- lm(medv~poly(lstat,2,raw=TRUE), data = boston)
summary(modelpoly2)
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.332821 0.123803 -18.84 <2e-16 ***
## poly(lstat, 2, raw = TRUE)2 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
modelpoly3 <- lm(medv~poly(lstat,3,raw=TRUE), data = boston)
summary(modelpoly3)
##
## Call:
## lm(formula = medv ~ poly(lstat, 3, raw = TRUE), data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5441 -3.7122 -0.5145 2.4846 26.4153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.6496253 1.4347240 33.909 < 2e-16 ***
## poly(lstat, 3, raw = TRUE)1 -3.8655928 0.3287861 -11.757 < 2e-16 ***
## poly(lstat, 3, raw = TRUE)2 0.1487385 0.0212987 6.983 9.18e-12 ***
## poly(lstat, 3, raw = TRUE)3 -0.0020039 0.0003997 -5.013 7.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.396 on 502 degrees of freedom
## Multiple R-squared: 0.6578, Adjusted R-squared: 0.6558
## F-statistic: 321.7 on 3 and 502 DF, p-value: < 2.2e-16
modelpoly4 <- lm(medv~poly(lstat,4,raw=TRUE), data = boston)
summary(modelpoly4)
##
## Call:
## lm(formula = medv ~ poly(lstat, 4, raw = TRUE), data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.563 -3.180 -0.632 2.283 27.181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.731e+01 2.280e+00 25.134 < 2e-16 ***
## poly(lstat, 4, raw = TRUE)1 -7.028e+00 7.308e-01 -9.618 < 2e-16 ***
## poly(lstat, 4, raw = TRUE)2 4.955e-01 7.489e-02 6.616 9.50e-11 ***
## poly(lstat, 4, raw = TRUE)3 -1.631e-02 2.994e-03 -5.448 7.98e-08 ***
## poly(lstat, 4, raw = TRUE)4 1.949e-04 4.043e-05 4.820 1.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.28 on 501 degrees of freedom
## Multiple R-squared: 0.673, Adjusted R-squared: 0.6704
## F-statistic: 257.8 on 4 and 501 DF, p-value: < 2.2e-16
modelpoly5 <- lm(medv~poly(lstat,5,raw=TRUE), data = boston)
summary(modelpoly5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5, raw = TRUE), data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.770e+01 3.604e+00 18.783 < 2e-16 ***
## poly(lstat, 5, raw = TRUE)1 -1.199e+01 1.526e+00 -7.859 2.39e-14 ***
## poly(lstat, 5, raw = TRUE)2 1.273e+00 2.232e-01 5.703 2.01e-08 ***
## poly(lstat, 5, raw = TRUE)3 -6.827e-02 1.438e-02 -4.747 2.70e-06 ***
## poly(lstat, 5, raw = TRUE)4 1.726e-03 4.167e-04 4.143 4.03e-05 ***
## poly(lstat, 5, raw = TRUE)5 -1.632e-05 4.420e-06 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
modelpoly6 <- lm(medv~poly(lstat,6,raw=TRUE), data = boston)
summary(modelpoly6)
##
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7317 -3.1571 -0.6941 2.0756 26.8994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.304e+01 5.593e+00 13.059 < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.517e+01 2.965e+00 -5.115 4.49e-07 ***
## poly(lstat, 6, raw = TRUE)2 1.930e+00 5.713e-01 3.378 0.000788 ***
## poly(lstat, 6, raw = TRUE)3 -1.307e-01 5.202e-02 -2.513 0.012295 *
## poly(lstat, 6, raw = TRUE)4 4.686e-03 2.407e-03 1.947 0.052066 .
## poly(lstat, 6, raw = TRUE)5 -8.416e-05 5.450e-05 -1.544 0.123186
## poly(lstat, 6, raw = TRUE)6 5.974e-07 4.783e-07 1.249 0.212313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.212 on 499 degrees of freedom
## Multiple R-squared: 0.6827, Adjusted R-squared: 0.6789
## F-statistic: 178.9 on 6 and 499 DF, p-value: < 2.2e-16
boston$pr1 <- predict(model13,newdata=boston)
boston$pr5 <- predict(modelpoly5,newdata=boston)
ggplot(boston)+geom_point(aes(lstat,medv))+geom_line(aes(lstat,pr1),color="blue",size=2)+geom_line(aes(lstat,pr5),color="red",linetype="solid",size=2)
winedata.csv. The dataset contains the following variables:vintage: year the wine was madeprice91: 1991 auction prices for the wine in dollarsprice92: 1992 auction prices for the wine in dollarstemp: Average temperature during the growing season in degree Celsiushrain: total harvest rain in mmwrain: total winter rain in mmtempdiff: sum of the difference between the maximum and minimum temperatures during the growing season in degree Celsiusage91 and age92 that captures the age of the wine (in years) at the time of the auctions. For example, a 1961 wine would have an age of 30 at the auction in 1991. What is the average price of wines that were 15 years or older at the time of the 1991 auction?wine<-read.csv("winedata.csv")
str(wine)
## 'data.frame': 25 obs. of 7 variables:
## $ vintage : int 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
## $ price91 : num 131.5 156 118 92.8 119.6 ...
## $ price92 : num 132 181 180 115 157 ...
## $ temp : num 18.8 18.8 18 NA 18.6 ...
## $ hrain : num 20.3 41 42.1 41.1 0.5 ...
## $ wrain : num 328 411 453 372 374 ...
## $ tempdiff: num 7.36 7.21 6.84 7.76 8.38 7.26 6.88 7.2 6.58 7.47 ...
wine$age91<-1991-wine$vintage
wine$age92<-1992-wine$vintage
mean(subset(wine$price91,wine$age91>=15))
## [1] 96.43563
mean(subset(wine$price91,wine$hrain<mean(wine$hrain)&wine$tempdiff<mean(wine$tempdiff)))
## [1] 72.86714
train<-subset(wine,vintage<=1981)
model1<-lm(log(price91)~age91,data=train)
summary(model1)
##
## Call:
## lm(formula = log(price91) ~ age91, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26897 -0.13328 0.01939 0.10452 0.41913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.571442 0.144163 24.774 6.30e-16 ***
## age91 0.042610 0.006899 6.176 6.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1914 on 19 degrees of freedom
## Multiple R-squared: 0.6675, Adjusted R-squared: 0.65
## F-statistic: 38.15 on 1 and 19 DF, p-value: 6.188e-06
intercept (\(\beta_0\)): [3:159; 3:98].age(\(\beta_1\)): [0:022; 0:062].confint(model1, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 3.15900093 3.98388288
## age91 0.02287254 0.06234705
test<-subset(wine,vintage>=1982)
predtest<-predict(model1,newdata=test)
sse<-sum((log(test$price91)-predtest)^2)
sst<-sum((log(test$price91)-mean(log(train$price91)))^2)
testR2<- 1-sse/sst
testR2
## [1] 0.9213742
age91, temp, hrain, wrain, tempdiff) in the training dataset. To fit your model, use the data for wines made up to (and including) the year 1981. What is the R-squared for the model?model2<-lm(log(price91)~temp+hrain+wrain+tempdiff+age91,data=train)
summary(model2)
##
## Call:
## lm(formula = log(price91) ~ temp + hrain + wrain + tempdiff +
## age91, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32298 -0.07019 -0.01634 0.11051 0.24455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2626138 1.4914578 1.517 0.1532
## temp 0.1135015 0.0734344 1.546 0.1462
## hrain -0.0028825 0.0016205 -1.779 0.0987 .
## wrain 0.0002520 0.0003918 0.643 0.5313
## tempdiff -0.1213280 0.1445947 -0.839 0.4166
## age91 0.0482331 0.0075346 6.402 2.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1803 on 13 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7938, Adjusted R-squared: 0.7145
## F-statistic: 10.01 on 5 and 13 DF, p-value: 0.000421
Is this model preferred to the model with only the age variable as a predictor (use the adjusted R-squared for the model to decide on this)?
Answer. With only the age variable, adjusted \(R^2=0.65\). On the other hand, with all the variables, adjusted \(R^2=0.7145\). This seems to indicate that the latter model (with more variables included) is preferred.
Which among the following best describes the output from the fitted model?
Of the five variables (age91, temp, hrain, wrain, tempdiff), drop the two variables that are the least significant from the results in (g). Rerun the linear regression and write down your fitted model.
Answer. The least significant variables are wrain and tempdiff with p-values 0.53 and 0.416 respectively and we create model3 removing the two.
model3<-lm(log(price91)~temp+hrain+age91,data=train)
summary(model3)
##
## Call:
## lm(formula = log(price91) ~ temp + hrain + age91, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32198 -0.09905 0.00491 0.14536 0.29828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.801013 1.297697 1.388 0.185
## temp 0.097500 0.068750 1.418 0.177
## hrain -0.001983 0.001236 -1.604 0.130
## age91 0.045670 0.006702 6.814 5.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1752 on 15 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7753, Adjusted R-squared: 0.7304
## F-statistic: 17.26 on 3 and 15 DF, p-value: 3.973e-05
Is this model preferred to the model with all variables as predictors (use the adjusted R-squared in the training set to decide on this)?
Answer. In the training set, adjusted \(R^2\) for this model is 0.73 while for model2, adjsuted \(R^2\) is 0.7145. In this case, the new model3 is preferred to model2.
Using the variables identified in (j), construct a multiple regression model to fit the log of the price at which the wine was auctioned in 1992 (remember to use age92 instead of age91). To fit your model, use the data for wines made up to (and including) the year 1981. What is the R-squared for the model?
Answer. \(R^2\) for this model is 0.5834.
model4<-lm(log(price92)~temp+hrain+age92,data=train)
summary(model4)
##
## Call:
## lm(formula = log(price92) ~ temp + hrain + age92, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29394 -0.15545 0.03238 0.10221 0.37661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.723393 1.574961 1.729 0.104296
## temp 0.072494 0.083333 0.870 0.398043
## hrain -0.001539 0.001498 -1.027 0.320713
## age92 0.035237 0.008124 4.338 0.000586 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2123 on 15 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5834, Adjusted R-squared: 0.5001
## F-statistic: 7.002 on 3 and 15 DF, p-value: 0.003621
Suppose in this application, we assume that a variable is statistically significant at the 0.2 level. Would you reject the hypothesis that the coefficient for the variable hrain is zero?
Answer. The p-value for hrain is 0.32. Hence we canot reject the null hypothesis that the coefficient for hrain is zero.
By separately estimating the equations for the wine prices for each auction, we can better establish the credibility of the explanatory variables because:
The current fit of the linear regression using the weather variables drops all observations where any of the entries are missing. Provide a short explanation on when this might not be a reasonable approach to use.
Answer. Clearly, dropping missing entries is reliable. However, if there are many missing entries, then this implies we can lose a lot of data.
iris dataset. The dataset is available in R.dim(iris)
## [1] 150 5
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris.data by removing the Species column and store its content as iris.sp. iris.data<-iris[,-5]
iris.sp<-iris[,5]
psych and pairs.panel function.library(psych)
pairs.panels(iris.data, ellipses = F, lm =T, breaks=10, hist.col="blue")
iris.data without standardizing the data. You may use prcomp(..., scale=F).
pr.out<-prcomp(iris.data,scale=F)
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
pr.out$sdev
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
pve<-pr.out$sdev^2/sum(pr.out$sdev^2)
cpve<-cumsum(pve)
pve
## [1] 0.924618723 0.053066483 0.017102610 0.005212184
cpve
## [1] 0.9246187 0.9776852 0.9947878 1.0000000
plot(cpve,xlab="Principal components",type="l",ylim=c(0.7,1))
In general it is suggested that data be scaled prior to conducting PCA.
ii) Plot the data along the first two principal components and color the different species separately. Does the first principal component create enough separation among the different species? To plot, you may use the function fviz_pca_ind or fviz_pca_biplot in library(factoextra). Alternatively, you may use biplot or construct a plot using ggplot2 as well.
Answer. The first principal component already seems to create a separation between setosa and the other two categories, although versicolor seems to be quite close to virginica. Moreover, it appears the same proportions of Petal length and Petal width are used in both PC1 and PC2 (the only difference is the magnitude). We can also create 95% confidence bounds of ellipses around the groups to see how much they are separated.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_pca_biplot(pr.out, label = "var", habillage=iris.sp)
#fviz_pca_biplot(pr.out, label = "var", habillage=iris.sp,addEllipses=TRUE, ellipse.level=0.95)
When data is standardized, one guideline to pick the number of relevant principal components is to pick the ones with eigen-values (the variances of each components) greater than 1 (which means they explain at least one variable). Check that sum of the eigen-values (variances) is the number of variables (which is four here).
From the plots we can see that after standardizing, the second principal component also contributes to explaining variability. Since the Petal lengths were already contributing to bigger numbers, a non-standardized version gave much more importance to this attribute. In the standardized version both Petal length and width tends to get same importance (since they are also quite correlated). Moreover Sepal width gets more prominent weights in principal component 1 as well as in principal component 2. The first two components having variances 1.71 and 0.96, it will natural to choose two principal components here for explaining data variablitiy.
pr.out.sc<-prcomp(iris.data,scale=T)
summary(pr.out.sc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
pr.out.sc$sdev
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
sum((pr.out.sc$sdev)^2)
## [1] 4
plot(pr.out.sc$sdev^2)+abline(h=1,col="red")
## integer(0)
pve.sc<-pr.out.sc$sdev^2/sum(pr.out.sc$sdev^2)
cpve.sc<-cumsum(pve.sc)
pve.sc
## [1] 0.729624454 0.228507618 0.036689219 0.005178709
cpve.sc
## [1] 0.7296245 0.9581321 0.9948213 1.0000000
plot(cpve.sc,xlab="Principal components",type="l",ylim=c(0.4,1))
# compare with the unscaled case:
#plot(cpve,xlab="Principal components",type="l",ylim=c(0.4,1))
library(factoextra)
fviz_pca_biplot(pr.out.sc, label = "var", habillage=iris.sp)
fviz_pca_biplot(pr.out.sc, label = "var", habillage=iris.sp,addEllipses=TRUE, ellipse.level=0.95)
7. This problem involves the dataset
wine$\_$italy.csv which was obtained from the University of Irvine Machine Learning Repository. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different . The analysis determined the quantities of 13 constituents found in each of the three types of wines. The first column identifies the cultivars and the next thirteen are the attributes given by: * alcohol: Alcohol * malic: Malic acid * ash: Ash * alkalin: Alkalinity of ash * mag: Magnesium * phenols: Total phenols * flavanoids: Flavanoids * nonflavanoids: Nonflavanoid phenols * proanth: Proanthocyanins * color: Color Intensity * hue: Hue * od280: OD280/ OD315 of diluted wines * proline: Proline
wineitaly<-read.csv("wine_italy.csv",sep=",",head=T)
str(wineitaly)
## 'data.frame': 178 obs. of 14 variables:
## $ class : int 1 1 1 1 1 1 1 1 1 1 ...
## $ alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ alkalin : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ nonflavanoids: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ proanth : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ od280 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
wine.it<-wineitaly[,-1]
wine.cl<-wineitaly[,1]
pairs.panels(wine.it, ellipses = F, lm =T, breaks=10, hist.col="blue")
b. Conduct a principal component analysis on the standardized data. What proportion of the total variability in the data is explained by the first two components?
Answer. The first two components explain 55% of the total variation.
pr.wine<-prcomp(wine.it,scale=T)
summary(pr.wine)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion 0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion 0.92018 0.94240 0.9617 0.97907 0.99205 1.00000
pr.wine$sdev
## [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128
## [8] 0.5903367 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244
#library(factoextra)
fviz_pca_biplot(pr.wine, label = "var", habillage=wine.cl)
fviz_pca_biplot(pr.wine, label = "var", habillage=wine.cl,addEllipses=TRUE, ellipse.level=0.95)
i) Which two key attributes differentiate Cultivar 2 from the other two cultivars?
Answer. From the principal component loadings for each attribute, we can see that Cultivar 2 (green triangles) has lighter color intensity and lower alcohol content (plotted opposite to the directions of these attributes in the PC1-PC2 axes). They also have a higher OD280/OD315 ratio and stronger hue than the other cultivars (but these weights are a little less that the first two).
pr.wine$rotation[order(pr.wine$rotation[,1],decreasing=T),1:2] ## order according to PC1
## PC1 PC2
## nonflavanoids 0.298533103 0.028779488
## malic 0.245187580 0.224930935
## alkalin 0.239320405 -0.010590502
## color 0.088616705 0.529995672
## ash 0.002051061 0.316068814
## magnesium -0.141992042 0.299634003
## alcohol -0.144329395 0.483651548
## proline -0.286752227 0.364902832
## hue -0.296714564 -0.279235148
## proanth -0.313429488 0.039301722
## od280 -0.376167411 -0.164496193
## phenols -0.394660845 0.065039512
## flavanoids -0.422934297 -0.003359812
pr.wine$rotation[order(pr.wine$rotation[,2],decreasing=T),1:2] ## order according to PC2
## PC1 PC2
## color 0.088616705 0.529995672
## alcohol -0.144329395 0.483651548
## proline -0.286752227 0.364902832
## ash 0.002051061 0.316068814
## magnesium -0.141992042 0.299634003
## malic 0.245187580 0.224930935
## phenols -0.394660845 0.065039512
## proanth -0.313429488 0.039301722
## nonflavanoids 0.298533103 0.028779488
## flavanoids -0.422934297 -0.003359812
## alkalin 0.239320405 -0.010590502
## od280 -0.376167411 -0.164496193
## hue -0.296714564 -0.279235148
ii) Which two key attributes differentiate Cultivar 3 from the other two cultivars?
Answer. Similarly we can see that Cultivar 3 (blue squares) has higher content of malic acid and non-flavanoids, as well as higher alkalinity especially with respect to cultivar 1.
pve.w<- pr.wine$sdev^2/sum(pr.wine$sdev^2)
cpve.w<- cumsum(pve.w)
pve.w
## [1] 0.361988481 0.192074903 0.111236305 0.070690302 0.065632937 0.049358233
## [7] 0.042386793 0.026807489 0.022221534 0.019300191 0.017368357 0.012982326
## [13] 0.007952149
cpve.w
## [1] 0.3619885 0.5540634 0.6652997 0.7359900 0.8016229 0.8509812 0.8933680
## [8] 0.9201754 0.9423970 0.9616972 0.9790655 0.9920479 1.0000000
plot(cpve.w,xlab="Principal components",type="l",ylim=c(0,1))+abline(h=0.8,col="red")+abline(v=5,col="blue")
## integer(0)
plot(pr.wine,type="l",ylim=c(0,5),main="Scree plot")+abline(h=1,col="red")
## integer(0)